겨울 워크숍용 데모 모델¶

  • Woolf와 비교 대상 작가들의 텍스트를 word2vec 모델로 만들어 본다.
  • 이미 만들어진 임베딩 모델을 불러와서 활용해 본다.
텍스트를 문장 단위로 나누기¶
In [1]:
source("utils.R")
In [2]:
txt <- readLines('data/joyce/1922_ulysses.txt')
print(txt[1:5])
[1] "Part One. The Telemachiad"                                                                                                                                                                                                                               
[2] ""                                                                                                                                                                                                                                                        
[3] "Episode 1. Telemachus"                                                                                                                                                                                                                                   
[4] ""                                                                                                                                                                                                                                                        
[5] "Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air. He held the bowl aloft and intoned:"
In [5]:
doclines <- readLines('data/joyce/1922_ulysses.txt')
splitted <- split_even(doclines, 1000)
print(splitted[1:2])
[1] "Part One. The Telemachiad  Episode 1. Telemachus  Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed. A yellow dressinggown, ungirdled, was sustained gently behind him on the mild morning air. He held the bowl aloft and intoned: —Introibo ad altare Dei. Halted, he peered down the dark winding stairs and called out coarsely: —Come up, Kinch! Come up, you fearful jesuit! Solemnly he came forward and mounted the round gunrest. He faced about and blessed gravely thrice the tower, the surrounding land and the awaking mountains. Then, catching sight of Stephen Dedalus, he bent towards him and made rapid crosses in the air, gurgling in his throat and shaking his head. Stephen Dedalus, displeased and sleepy, leaned his arms on the top of the staircase and looked coldly at the shaking gurgling face that blessed him, equine in its length, and at the light untonsured hair, grained and hued like pale oak. Buck Mulligan peeped an instant under the mirror and then covered the bowl smartly. —Back to barracks! he said sternly. He added in a preacher’s tone: —For this, O dearly beloved, is the genuine Christine: body and soul and blood and ouns. Slow music, please. Shut your eyes, gents. One moment. A little trouble about those white corpuscles. Silence, all. He peered sideways up and gave a long slow whistle of call, then paused awhile in rapt attention, his even white teeth glistening here and there with gold points. Chrysostomos. Two strong shrill whistles answered through the calm. —Thanks, old chap, he cried briskly. That will do nicely. Switch off the current, will you? He skipped off the gunrest and looked gravely at his watcher, gathering about his legs the loose folds of his gown. The plump shadowed face and sullen oval jowl recalled a prelate, patron of arts in the middle ages. A pleasant smile broke quietly over his lips. —The mockery of it! he said gaily. Your absurd name, an ancient Greek! He pointed his finger in friendly jest and went over to the parapet, laughing to himself. Stephen Dedalus stepped up, followed him wearily halfway and sat down on the edge of the gunrest, watching him still as he propped his mirror on the parapet, dipped the brush in the bowl and lathered cheeks and neck. Buck Mulligan’s gay voice went on. —My name is absurd too: Malachi Mulligan, two dactyls. But it has a Hellenic ring, hasn’t it? Tripping and sunny like the buck himself. We must go to Athens. Will you come if I can get the aunt to fork out twenty quid? He laid the brush aside and, laughing with delight, cried: —Will he come? The jejune jesuit! Ceasing, he began to shave with care. —Tell me, Mulligan, Stephen said quietly. —Yes, my love? —How long is Haines going to stay in this tower? Buck Mulligan showed a shaven cheek over his right shoulder. —God, isn’t he dreadful? he said frankly. A ponderous Saxon. He thinks you’re not a gentleman. God, these bloody English! Bursting with money and indigestion. Because he comes from Oxford. You know, Dedalus, you have the real Oxford manner. He can’t make you out. O, my name for you is the best: Kinch, the knife-blade. He shaved warily over his chin. —He was raving all night about a black panther, Stephen said. Where is his guncase? —A woful lunatic! Mulligan said. Were you in a funk? —I was, Stephen said with energy and growing fear. Out here in the dark with a man I don’t know raving and moaning to himself about shooting a black panther. You saved men from drowning. I’m not a hero, however. If he stays on here I am off. Buck Mulligan frowned at the lather on his razorblade. He hopped down from his perch and began to search his trouser pockets hastily. —Scutter! he cried thickly. He came over to the gunrest and, thrusting a hand into Stephen’s upper pocket, said: —Lend us a loan of your noserag to wipe my razor. Stephen suffered him to pull out and hold up on show by its corner a dirty crumpled handkerchief. Buck Mulligan wiped the razorblade neatly. Then, gazing over the handkerchief, he said: —The bard’s noserag! A new art colour for our Irish poets: snotgreen. You can almost taste it, can’t you? He mounted to the parapet again and gazed out over Dublin bay, his fair oakpale hair stirring slightly. —God! he said quietly. Isn’t the sea what Algy calls it: a grey sweet mother? The snotgreen sea. The scrotumtightening sea. Epi oinopa ponton. Ah, Dedalus, the Greeks! I must teach you. You must read them in the original. Thalatta! Thalatta! She is our great sweet mother. Come and look. Stephen stood up and went over to the parapet. Leaning on it he looked down on the water and on the mailboat clearing the harbourmouth of Kingstown. —Our mighty mother! Buck Mulligan said. He turned abruptly his grey searching eyes from the sea to Stephen’s face. —The aunt thinks you killed your mother, he said. That’s why she won’t let me have anything to do with you. —Someone killed her, Stephen said gloomily. —You could have knelt down, damn it, Kinch, when your dying mother asked you, Buck Mulligan said. I’m hyperborean as much as you. But to think of your mother begging you with her last breath to kneel down and pray for her. And you refused. There is something sinister in you... He broke off and lathered again lightly his farther cheek. A tolerant smile curled his lips. —But a lovely mummer! he murmured to himself. Kinch, the loveliest mummer of them all! He shaved evenly and with care, in silence, seriously. Stephen, an elbow rested on the jagged granite, leaned his palm against his brow and gazed at the fraying edge of his shiny black coat-sleeve. Pain, that was not yet the pain of love,"
[2] "fretted his heart. Silently, in a dream she had come to him after her death, her wasted body within its loose brown graveclothes giving off an odour of wax and rosewood, her breath, that had bent upon him, mute, reproachful, a faint odour of wetted ashes. Across the threadbare cuffedge he saw the sea hailed as a great sweet mother by the wellfed voice beside him. The ring of bay and skyline held a dull green mass of liquid. A bowl of white china had stood beside her deathbed holding the green sluggish bile which she had torn up from her rotting liver by fits of loud groaning vomiting. Buck Mulligan wiped again his razorblade. —Ah, poor dogsbody! he said in a kind voice. I must give you a shirt and a few noserags. How are the secondhand breeks? —They fit well enough, Stephen answered. Buck Mulligan attacked the hollow beneath his underlip. —The mockery of it, he said contentedly. Secondleg they should be. God knows what poxy bowsy left them off. I have a lovely pair with a hair stripe, grey. You’ll look spiffing in them. I’m not joking, Kinch. You look damn well when you’re dressed. —Thanks, Stephen said. I can’t wear them if they are grey. —He can’t wear them, Buck Mulligan told his face in the mirror. Etiquette is etiquette. He kills his mother but he can’t wear grey trousers. He folded his razor neatly and with stroking palps of fingers felt the smooth skin. Stephen turned his gaze from the sea and to the plump face with its smokeblue mobile eyes. —That fellow I was with in the Ship last night, said Buck Mulligan, says you have g.p.i. He’s up in Dottyville with Connolly Norman. General paralysis of the insane! He swept the mirror a half circle in the air to flash the tidings abroad in sunlight now radiant on the sea. His curling shaven lips laughed and the edges of his white glittering teeth. Laughter seized all his strong wellknit trunk. —Look at yourself, he said, you dreadful bard! Stephen bent forward and peered at the mirror held out to him, cleft by a crooked crack. Hair on end. As he and others see me. Who chose this face for me? This dogsbody to rid of vermin. It asks me too. —I pinched it out of the skivvy’s room, Buck Mulligan said. It does her all right. The aunt always keeps plainlooking servants for Malachi. Lead him not into temptation. And her name is Ursula. Laughing again, he brought the mirror away from Stephen’s peering eyes. —The rage of Caliban at not seeing his face in a mirror, he said. If Wilde were only alive to see you! Drawing back and pointing, Stephen said with bitterness: —It is a symbol of Irish art. The cracked looking-glass of a servant. Buck Mulligan suddenly linked his arm in Stephen’s and walked with him round the tower, his razor and mirror clacking in the pocket where he had thrust them. —It’s not fair to tease you like that, Kinch, is it? he said kindly. God knows you have more spirit than any of them. Parried again. He fears the lancet of my art as I fear that of his. The cold steel pen. —Cracked lookingglass of a servant! Tell that to the oxy chap downstairs and touch him for a guinea. He’s stinking with money and thinks you’re not a gentleman. His old fellow made his tin by selling jalap to Zulus or some bloody swindle or other. God, Kinch, if you and I could only work together we might do something for the island. Hellenise it. Cranly’s arm. His arm. —And to think of your having to beg from these swine. I’m the only one that knows what you are. Why don’t you trust me more? What have you up your nose against me? Is it Haines? If he makes any noise here I’ll bring down Seymour and we’ll give him a ragging worse than they gave Clive Kempthorpe. Young shouts of moneyed voices in Clive Kempthorpe’s rooms. Palefaces: they hold their ribs with laughter, one clasping another. O, I shall expire! Break the news to her gently, Aubrey! I shall die! With slit ribbons of his shirt whipping the air he hops and hobbles round the table, with trousers down at heels, chased by Ades of Magdalen with the tailor’s shears. A scared calf’s face gilded with marmalade. I don’t want to be debagged! Don’t you play the giddy ox with me! Shouts from the open window startling evening in the quadrangle. A deaf gardener, aproned, masked with Matthew Arnold’s face, pushes his mower on the sombre lawn watching narrowly the dancing motes of grasshalms. To ourselves... new paganism... omphalos. —Let him stay, Stephen said. There’s nothing wrong with him except at night. —Then what is it? Buck Mulligan asked impatiently. Cough it up. I’m quite frank with you. What have you against me now? They halted, looking towards the blunt cape of Bray Head that lay on the water like the snout of a sleeping whale. Stephen freed his arm quietly. —Do you wish me to tell you? he asked. —Yes, what is it? Buck Mulligan answered. I don’t remember anything. He looked in Stephen’s face as he spoke. A light wind passed his brow, fanning softly his fair uncombed hair and stirring silver points of anxiety in his eyes. Stephen, depressed by his own voice, said: —Do you remember the first day I went to your house after my mother’s death? Buck Mulligan frowned quickly and said: —What? Where? I can’t remember anything. I remember only ideas and sensations. Why? What happened in the name of God? —You were making tea, Stephen said, and went across the landing to get more hot water. Your mother and some visitor came out of the drawingroom. She asked you who was in your room. —Yes? Buck Mulligan said. What did I say? I forget. —You said,"                                                                                                                                                            
In [405]:
# 특정 디렉토리에 있는 파일들을 읽어 문장으로 나눈 후 저장하기
target_dir <- "./data/woolf"
target_files <- list.files(target_dir, "txt")

for (file in target_files) {
    doc <- readLines(file.path(target_dir, file))

    sents <- split_sentences(doclines = doc)
    sents <- unlist(sents)

    filename_1 <- gsub("(.*)(\\.txt)", "\\1_sent\\2", file)
    write(sents, file.path(target_dir, filename_1))

    even_splits <- split_even(doc, 50)
    
    filename_2 <- gsub("(.*)(\\.txt)", "\\1_even\\2", file)
    write(even_splits, file.path(target_dir, filename_2))
}
In [ ]:

임베딩 모델 만들기(Building embedding models)¶
  • 기본적으로 한 줄에 한 문장 내지 한 문단 정도의 길이로 된 텍스트 파일을 이용
  • 텍스트 파일을 읽은 후 정규화(소문자 변환) -> 하나의 문서로 만듦
  • 임베딩 모델 생성
In [72]:
library(word2vec)
In [249]:
# 텍스트 파일들을 읽은 후 모두 소문자로 변환한다.
# 아래 예시에서는 woolf 디렉토리에 있는 작품 파일들을 모두 한번에 읽어 하나의 문서로 만든다.
woolf_files <- list.files("./data/woolf", "even.*txt", full.names = TRUE)

# for 구문 대신 lapply 함수를 이용. for 구문을 사용해도 됨.
woolf_texts <- lapply(woolf_files, function(x) {
    doc <- readLines(x)
    doc <- tolower(doc)
    return(doc)
})

woolf_texts <- unlist(woolf_texts)
In [251]:
# set.seed(10)

# word2vec 함수로 woolf_texts를 이용한 임베딩 모델을 생성
woolf_model <- word2vec(x = woolf_texts, 
                        type = "skip-gram", 
                        dim = 50, 
                        window = 5, 
                        negative = 5, 
                        iter = 100,
                        threads = 8)
In [ ]:
# 생성된 모델의 내용을 확인
str(woolf_model)
In [406]:
# 생성한 모델을 저장
write.word2vec(woolf_model, "./analysis/embeddings/woolf_model.bin")
TRUE
In [73]:
# # 저장한 모델을 읽어오기
# woolf_model_loaded <- read.word2vec("./analysis/embeddings/woolf_model.bin")
In [294]:
# 유사한 단어 추출하기(한 단어)
preds <- predict(woolf_model, 'queer', type = "nearest")
print(preds)
$queer
   term1       term2 similarity rank
1  queer         odd  0.8386722    1
2  queer        kind  0.8193455    2
3  queer     strange  0.7842509    3
4  queer     chuckle  0.7828981    4
5  queer  disturbing  0.7712168    5
6  queer      sickly  0.7691061    6
7  queer       vague  0.7686276    7
8  queer frightening  0.7667588    8
9  queer     amusing  0.7642149    9
10 queer spontaneous  0.7588287   10

In [281]:
# 유사한 단어 추출하기(두 단어 이상)
preds <- predict(woolf_model, c('lady', 'gentleman'), type = "nearest")
print(preds)
$lady
   term1     term2 similarity rank
1   lady     queen  0.8267298    1
2   lady gentleman  0.8171982    2
3   lady    friend  0.8119703    3
4   lady   walpole  0.8112606    4
5   lady    prince  0.8082371    5
6   lady      lord  0.7913446    6
7   lady   dorothy  0.7849203    7
8   lady      earl  0.7843787    8
9   lady     lover  0.7814237    9
10  lady  daughter  0.7805042   10

$gentleman
       term1    term2 similarity rank
1  gentleman  servant  0.8340471    1
2  gentleman     lady  0.8171982    2
3  gentleman      man  0.8027343    3
4  gentleman     girl  0.7983292    4
5  gentleman princess  0.7849668    5
6  gentleman    woman  0.7830075    6
7  gentleman   haired  0.7782508    7
8  gentleman     maid  0.7748181    8
9  gentleman    young  0.7730316    9
10 gentleman    tweed  0.7597733   10

임베딩 모델의 시각화¶
  • t-sne 알고리즘으로 차원 축소 (50차원 -> 2차원) -> XY 그래프로 표현
In [88]:
# 필요한 라이브러리 탑재
library(Rtsne)
library(ggplot2)
library(ggrepel)
library(plotly)
Attaching package: ‘plotly’


The following object is masked from ‘package:ggplot2’:

    last_plot


The following object is masked from ‘package:stats’:

    filter


The following object is masked from ‘package:graphics’:

    layout


In [120]:
# 생성한 woolf 모델을 매트릭스로 변환
woolf_embedding <- as.matrix(woolf_model_loaded)

print(dim(woolf_embedding))
cat("----------------\n")
print(woolf_embedding[1:3, 1:5])
[1] 11704    50
----------------
               [,1]     [,2]       [,3]       [,4]       [,5]
brompton -0.4057100 1.256959 -1.4611098 -2.2586901 -0.9518408
trailed  -0.5232229 0.356214 -1.1081522  0.7148921 -0.7097381
scope     0.3262754 1.411586  0.5579904 -0.5022139 -0.6621220
In [85]:
# Rtsne 함수로 50차원을 2차원으로 축소하여 x, y 매트릭스로 만듦
dim_redu <- Rtsne(woolf_embedding, dims = 2, pca = TRUE)
viz <- dim_redu$Y
print(head(viz))
            [,1]       [,2]
[1,]  -9.0640912   2.687415
[2,]   0.7679984  11.271722
[3,]  10.9026046 -12.119877
[4,]  13.5127512  19.770924
[5,] -13.2992564 -10.646265
[6,]  -8.3744385   7.362357
In [84]:
# 전체 어휘 중 1번부터 50번까지만 시각화해 봄
plot(viz[1:50,], t = "n")
text(viz[1:50,], labels = rownames(woolf_embedding)[1:50])
특정 단어를 입력하여 그 단어와 유사한 단어 찾기¶
  • 'queer'와 유사한 단어들의 임베딩값 얻기
In [79]:
# 'queer'와 유사한 단어 추출
queer_similar_words <- predict(woolf_model_loaded, 'queer', type = "nearest", top_n = 20)[[1]]$term2
print(queer_similar_words)
 [1] "odd"          "kind"         "strange"      "chuckle"      "disturbing"  
 [6] "sickly"       "vague"        "frightening"  "amusing"      "spontaneous" 
[11] "frightened"   "disagreeable" "curious"      "charming"     "amusement"   
[16] "sad"          "terrifying"   "disliked"     "oddly"        "sinister"    
In [80]:
# 'queer' 및 queer와 유사한 어휘에 해당하는 아이디 번호 확보
queer_id <- which(rownames(woolf_embedding) == "queer")

queer_sims_ids <- sapply(queer_similar_words, function(x) {
                    which(rownames(woolf_embedding) == x)
                   }, USE.NAMES = FALSE)

print(c(queer_id, queer_sims_ids))
 [1]   368  2448  7496  2955  8196  3312  3781 11512  7441  8872 10932  1833
[13]  7939   704  7039  8830  8210 11096  3609  1961  6071
In [83]:
# 아이디 번호로 해당 어휘들의 2차원 매트릭스 추출
queer_embeddings <- viz[c(queer_id, queer_sims_ids),]
print(head(queer_embeddings))
          [,1]       [,2]
[1,] 10.307106  -4.927119
[2,] 10.214328  -4.861577
[3,] 11.589360 -13.301554
[4,] 12.701972  -1.232904
[5,] 10.220544  -4.735102
[6,]  8.919012 -10.750757
In [86]:
# 추출한 매트릭스를 그래프로 표현
plot(queer_embeddings, t = "n")
text(queer_embeddings, labels = c("queer", queer_similar_words))
In [104]:
# 또다른 그래프: 뭉쳐진 레이블을 떨어뜨림/
df_ <- data.frame(word = c("queer", queer_similar_words),
                  X = queer_embeddings[, 1], 
                  Y = queer_embeddings[, 2])

ggplot(df_, aes(x = X, y = Y, label = word)) +
    geom_point() +
    geom_text_repel(max.overlaps=Inf) + theme_void() +
    labs(title = "word2vec_woolf_queer") +
    theme_minimal()
In [105]:
# interactive plot으료 표현
plot_ly(df_, x = ~X, y = ~Y, type = "scatter", mode = "text", text = ~word)
In [106]:
# interactive plot의 저장 (html 파일로 저장)
library(htmlwidgets)

fig <- plot_ly(df_, x = ~X, y = ~Y, type = "scatter", mode = "text", text = ~word)

saveWidget(widget = fig, #the plotly object
           file = "./analysis/figures/queer_embeddings.html", #the path & file name
           selfcontained = TRUE) #creates a single html file
In [109]:
# 다른 알고리즘(UMAP)으로 차원 축소해 보기
library(uwot)
In [108]:
viz <- umap(woolf_embedding)
print(head(viz))
               [,1]       [,2]
brompton  1.0314549 -1.2062538
trailed   2.7801361  0.6954176
scope    -1.8108150  2.1450133
stripe    2.7365763  0.8246590
bullied  -1.6193800 -0.9488240
lewes    -0.9114819 -0.7766817
In [110]:
queer_embeddings <- viz[c("queer", queer_similar_words),]
print(head(queer_embeddings))
                 [,1]     [,2]
queer      -0.5117765 1.136436
odd        -0.4817193 1.002571
kind       -1.4693935 1.746537
strange    -0.3223265 1.053952
chuckle    -0.5028764 1.089082
disturbing -1.1016185 1.648079
In [111]:
df_ <- data.frame(word = rownames(queer_embeddings),
                  x = queer_embeddings[, 1], 
                  y = queer_embeddings[, 2])
In [112]:
ggplot(df_, aes(x = x, y = y, label = word)) +
    geom_point() +
    geom_text_repel(size = 3, max.overlaps = Inf) + 
    labs(title = "word2vec_queer_woolf") +
    theme_minimal()
In [113]:
# interactive plot으료 표현                  
plot_ly(df_, x = ~x, y = ~y, type = "scatter", mode = "text", text = ~word)
Lawrence, Stein, Joyce 등 세 작가의 임베딩을 생성 후 저장하기¶
In [114]:
# 세 작가의 텍스트를 모두 읽어오기
author_dirs <- c("./data/lawrence", "./data/stein", "./data/joyce")
author_files <- lapply(author_dirs, function(x) list.files(x, "even.*txt", full.names = TRUE))

# 읽어온 텍스트를 리스트로 저장
author_texts <- list()
for (author in author_files) {
    texts <- lapply(author, function(x) {
        doc <- readLines(x)
        doc <- tolower(doc)
        return(doc)
    })

    texts <- unlist(texts)
    author_texts[[length(author_texts)+1]] <- texts
}
In [115]:
print(author_texts[[2]][1:3])
[1] "the good anna    part i   the tradesmen of bridgepoint learned to dread the sound of \"miss mathilda\", for with that name the good anna always conquered.  the strictest of the one price stores found that they could give things for a little less, when"               
[2] "the good anna had fully said that \"miss mathilda\" could not pay so much and that she could buy it cheaper \"by lindheims.\"  lindheims was anna's favorite store, for there they had bargain days, when flour and sugar were sold for a quarter of a cent less for a"    
[3] "pound, and there the heads of the departments were all her friends and always managed to give her the bargain prices, even on other days.  anna led an arduous and troubled life.  anna managed the whole little house for miss mathilda. it was a funny little house, one"
In [4]:
# 읽어온 텍스트로 작가별 임베딩 모델 만들기
authors <- c("Lawrence", "Stein", "Joyce")  # 진행 상황을 나타내기 위해 작가 이름을 벡터로 만듦

# 작가별 임베딩 모델링
author_models <- list()
for (i in 1:length(author_texts)) {
    message(paste(authors[i], "processing start---"))

    author_model <- word2vec(x = author_texts[[i]], 
                             type = "skip-gram", 
                             dim = 50, 
                             window = 5, 
                             negative = 5, 
                             iter = 200,
                             threads = 8)

    author_models[[authors[i]]] <- author_model

    message(paste(authors[i], "processing ended---"))
}
Lawrence processing start---

Lawrence processing ended---

Stein processing start---

Stein processing ended---

Joyce processing start---

Joyce processing ended---

In [ ]:
# 생성한 임베딩 모델 확인해 보기
print(predict(author_models[[1]], "queer", type = "nearest"))
In [6]:
# 생성한 모델들을 저장
write.word2vec(author_models[[1]], "./analysis/embeddings/lawrence_embeddings.bin")
write.word2vec(author_models[[2]], "./analysis/embeddings/stein_embeddings.bin")
write.word2vec(author_models[[3]], "./analysis/embeddings/joyce_embeddings.bin")
TRUE
TRUE
TRUE
작가별 긍정어/부정어 임베딩 벡터 비교¶
  • 긍정어/부정어 목록 읽어오기
  • 전체 작가 임베딩 모델 읽어오기
  • 'queer'와 유사어 100개 목록 추출하기
In [116]:
# 유사도를 계산하기 위해 필요한 라이브러리를 탑재
library(lsa)
In [16]:
# 긍정어/부정어 목록 읽어오기
positive_lex <- read.csv("./opinion-lexicon/positive-words.txt", header = FALSE, comment.char = ";")
negative_lex<- read.csv("./opinion-lexicon/negative-words.txt", header = FALSE, comment.char = ";")

print(positive_lex[1:3,])
[1] "a+"      "abound"  "abounds"
In [131]:
# 작가별 임베딩 모델 로딩하기
authors <- c("joyce", "lawrence", "stein", "woolf")

target_dir <- "./analysis/embeddings"
embedding_files <- sort(list.files(target_dir, "bin", full.names = TRUE))

author_models <- list()
for (i in 1:length(embedding_files)) {
    embed <- read.word2vec(embedding_files[i])
    author_models[[authors[i]]] <- embed
}

author_embeddings <- list()
for (i in 1:length(author_models)) {
    embed <- as.matrix(author_models[[i]])
    author_embeddings[[authors[i]]] <- embed
}
In [140]:
# 'queer'와 유사한 100개 단어 추출하기
authors <- c("joyce", "lawrence", "stein", "woolf")

queer_nearest_words <- list()
for (i in 1:length(author_embeddings)) {
    nearest_words <- predict(author_models[[i]], newdata = "queer", type = "nearest", top_n = 100)
    queer_nearest_words[[authors[i]]] <- nearest_words[[1]]
}
In [141]:
# 작가별 queer 유사어에서 긍정어와 부정어 비율 구하기
joyce_pos_words <- queer_nearest_words[['joyce']]$term2[queer_nearest_words[['joyce']]$term2 %in% positive_lex$V1]
joyce_neg_words <- queer_nearest_words[['joyce']]$term2[queer_nearest_words[['joyce']]$term2 %in% negative_lex$V1]

lawrence_pos_words <- queer_nearest_words[['lawrence']]$term2[queer_nearest_words[['lawrence']]$term2 %in% positive_lex$V1]
lawrence_neg_words <- queer_nearest_words[['lawrence']]$term2[queer_nearest_words[['lawrence']]$term2 %in% negative_lex$V1]

stein_pos_words <- queer_nearest_words[['stein']]$term2[queer_nearest_words[['stein']]$term2 %in% positive_lex$V1]
stein_neg_words <- queer_nearest_words[['stein']]$term2[queer_nearest_words[['stein']]$term2 %in% negative_lex$V1]

woolf_pos_words <- queer_nearest_words[['woolf']]$term2[queer_nearest_words[['woolf']]$term2 %in% positive_lex$V1]
woolf_neg_words <- queer_nearest_words[['woolf']]$term2[queer_nearest_words[['woolf']]$term2 %in% negative_lex$V1]

print(woolf_pos_words)
cat("-----------------\n")
print(woolf_neg_words)
 [1] "amusing"         "spontaneous"     "charming"        "astonishingly"  
 [5] "attractive"      "smile"           "pretty"          "awe"            
 [9] "wonderful"       "astonishing"     "magnificent"     "amiable"        
[13] "nice"            "sharp"           "modest"          "sensitive"      
[17] "bright"          "romantic"        "humble"          "prominent"      
[21] "extraordinarily"
 [1] "odd"          "strange"      "disturbing"   "sickly"       "vague"       
 [6] "frightening"  "disagreeable" "sad"          "disliked"     "oddly"       
[11] "sinister"     "solicitude"   "painful"      "awfully"      "glum"        
[16] "ominous"      "unpleasant"   "distasteful"  "sly"          "incongruous" 
[21] "distaste"     "suspiciously" "mystery"      "pathetic"     "unnecessary" 
[26] "oddest"       "unusual"      "melancholy"   "flimsy"       "alarming"    
[31] "boredom"      "object"      
In [142]:
# 작가별 긍정어와 부정어 어휘 갯수 확인
print(paste("The number of positive words in Joyce is", length(joyce_pos_words)))
print(paste("The number of negative words in Joyce is", length(joyce_neg_words)))
print(paste("The number of positive words in Lawrence is", length(lawrence_pos_words)))
print(paste("The number of negative words in Lawrence is", length(lawrence_neg_words)))
print(paste("The number of positive words in Stein is", length(stein_pos_words)))
print(paste("The number of negative words in Stein is", length(stein_neg_words)))
print(paste("The number of positive words in Woolf is", length(woolf_pos_words)))
print(paste("The number of negative words in Woolf is", length(woolf_neg_words)))
[1] "The number of positive words in Joyce is 8"
[1] "The number of negative words in Joyce is 16"
[1] "The number of positive words in Lawrence is 9"
[1] "The number of negative words in Lawrence is 32"
[1] "The number of positive words in Stein is 9"
[1] "The number of negative words in Stein is 29"
[1] "The number of positive words in Woolf is 21"
[1] "The number of negative words in Woolf is 32"
In [143]:
# 작가별 긍정어와 부정어 어휘 비율 확인(100개 중)
print(paste("The ratio of positive words in Joyce is", length(joyce_pos_words)/100))
print(paste("The ratio of negative words in Joyce is", length(joyce_neg_words)/100))
print(paste("The ratio of positive words in Lawrence is", length(lawrence_pos_words)/100))
print(paste("The ratio of negative words in Lawrence is", length(lawrence_neg_words)/100))
print(paste("The ratio of positive words in Stein is", length(stein_pos_words)/100))
print(paste("The ratio of negative words in Stein is", length(stein_neg_words)/100))
print(paste("The ratio of positive words in Woolf is", length(woolf_pos_words)/100))
print(paste("The ratio of negative words in Woolf is", length(woolf_neg_words)/100))
[1] "The ratio of positive words in Joyce is 0.08"
[1] "The ratio of negative words in Joyce is 0.16"
[1] "The ratio of positive words in Lawrence is 0.09"
[1] "The ratio of negative words in Lawrence is 0.32"
[1] "The ratio of positive words in Stein is 0.09"
[1] "The ratio of negative words in Stein is 0.29"
[1] "The ratio of positive words in Woolf is 0.21"
[1] "The ratio of negative words in Woolf is 0.32"
In [144]:
# woolf 모델을 임베딩 매트릭스로 변환 -> 벡터 유사도 계산을 위해
woolf_embeddings <- as.matrix(author_embeddings[['woolf']],)
In [168]:
# 특정 단어와 유사도가 높은 어휘들을 찾는다. 이 때 predict 함수로 검색 어휘의 임베딩 매트릭스를 얻는다.
# woolf 모델에서 love에 해당하는 임베딩 벡터를 구한 후, 
wv <- predict(author_models[['woolf']], newdata = "love", type = "embedding")
val <- wv[1,]

# 혹은 다음과 같이 임베딩 매트릭스에서 원하는 어휘의 임베딩 벡터를 구할 수도 있다.
# val <- author_embeddings[['woolf']]['love',]

# 이 벡터와 유사한 벡터를 전체 벡터들 중에서 찾는다. apply 함수를 이용하면 빨리 처리할 수 있다.
res <- apply(woolf_embeddings, 1, function(rw) cosine(x = val, y = rw))    # val 값을 전체 임베딩 매트릭스와 비교
print(sort(res, decreasing = TRUE)[1:10])
        love      passion    hypocrisy         envy   friendship      dislike 
   1.0000000    0.6378813    0.6300378    0.6239195    0.6081415    0.5962149 
passionately        death         lust        ‘life 
   0.5943222    0.5912219    0.5841448    0.5809732 
In [170]:
# 두 어휘의 임베딩 벡터 거리를 계산한다(행렬 연산).
wv <- predict(author_models[['woolf']], newdata = c("love", "hate"), type = "embedding")

# 각 단어의 임베딩 벡터를 더한다. 즉 'love'와 'hate'가 더해진 임베딩 벡터를 구한다.
val <- wv['love',] + wv['hate',]
# print(val)
In [171]:
# 'love'와 'hate'가 더해진 임베딩 벡터와 가장 가까운 벡터를 가진 어휘를 찾는다.
# 벡터간 유사도를 계산할 때는 코사인 유사도(cosine 함수)를 이용한다.
res <- apply(woolf_embeddings, 1, function(rw) cosine(x = val, y = rw))    # val 값을 전체 임베딩 매트릭스와 비교
print(sort(res, decreasing = TRUE)[1:10])
       love        hate    vanities        envy      suffer   hypocrisy 
  0.8849086   0.8849086   0.6769011   0.6766069   0.6338068   0.6229489 
     hatred    savagery    jealousy interfering 
  0.6086594   0.5857570   0.5846844   0.5801688 
In [154]:
# queer와 유사한 단어들 중 긍정/부정 어휘들의 벡터들과 queer 벡터의 유사도를 구한다.
woolf_pos_sims <- apply(author_embeddings[['woolf']][woolf_pos_words,], 1, function(rw) cosine(x = woolf_embeddings['queer',], y = rw))
woolf_neg_sims <- apply(author_embeddings[['woolf']][woolf_neg_words,], 1, function(rw) cosine(x = woolf_embeddings['queer',], y = rw))

# 거리값은 유사도값을 뒤집은 것으므로 "1 - 유사도"이다. 
woolf_pos_dists <- 1 - woolf_pos_sims
woolf_neg_dists <- 1 - woolf_neg_sims

# 긍정/부정 어휘들의 거리값 평균을 구한 후 긍정어와 부정어의 차이를 구한다.
woolf_positivity_queer <- mean(woolf_neg_dists) - mean(woolf_pos_dists)
# woolf_positivity_queer <- sum(woolf_pos_sims) - sum(woolf_neg_sims)
print(woolf_positivity_queer)
[1] -0.02003773
In [166]:
# queer와 긍정어와 부정어들간 거리값을 확인해 본다.
print(woolf_pos_dists)
cat('--------------\n')
print(woolf_neg_dists)
        amusing     spontaneous        charming   astonishingly      attractive 
      0.4159755       0.4241790       0.4355073       0.4801285       0.4877095 
          smile          pretty             awe       wonderful     astonishing 
      0.4945430       0.4945625       0.4975180       0.5012383       0.5012868 
    magnificent         amiable            nice           sharp          modest 
      0.5096653       0.5102666       0.5120216       0.5221945       0.5297301 
      sensitive          bright        romantic          humble       prominent 
      0.5301137       0.5310545       0.5336006       0.5353009       0.5371554 
extraordinarily 
      0.5404406 
--------------
         odd      strange   disturbing       sickly        vague  frightening 
   0.2966291    0.3849504    0.4052247    0.4084758    0.4092115    0.4120809 
disagreeable          sad     disliked        oddly     sinister   solicitude 
   0.4322023    0.4518979    0.4589085    0.4614068    0.4628387    0.4719150 
     painful      awfully         glum      ominous   unpleasant  distasteful 
   0.4792584    0.4887541    0.4982792    0.5020226    0.5032926    0.5078179 
         sly  incongruous     distaste suspiciously      mystery     pathetic 
   0.5097191    0.5099150    0.5123862    0.5210352    0.5220025    0.5220147 
 unnecessary       oddest      unusual   melancholy       flimsy     alarming 
   0.5220989    0.5289497    0.5306460    0.5335118    0.5343147    0.5363234 
     boredom       object 
   0.5375488    0.5400248 
In [167]:
# 다른 세 작가에 대해서도 동일한 과정을 적용한다.
# 각 작가의 임베딩 벡터 매트릭스를 만든다.
joyce_embeddings <- as.matrix(author_embeddings[['joyce']])
lawrence_embeddings <- as.matrix(author_embeddings[['lawrence']])
stein_embeddings <- as.matrix(author_embeddings[['stein']])
In [172]:
# joyce
joyce_pos_sims <- apply(joyce_embeddings[joyce_pos_words,], 1, function(rw) cosine(x = joyce_embeddings['queer',], y = rw))
joyce_neg_sims <- apply(joyce_embeddings[joyce_neg_words,], 1, function(rw) cosine(x = joyce_embeddings['queer',], y = rw))
joyce_pos_dists <- 1 - joyce_pos_sims
joyce_neg_dists <- 1 - joyce_neg_sims

joyce_positivity_queer <- mean(joyce_neg_dists) - mean(joyce_pos_dists)
# joyce_positivity_queer <- sum(joyce_pos_sims) - sum(joyce_neg_sims)
print(joyce_positivity_queer)
[1] 0.004759233
In [177]:
print(joyce_pos_dists)
cat('--------------\n')
print(joyce_neg_dists)
  excited    lovely       awe    decent   satisfy      nice  thrilled     nicer 
0.4210885 0.4580258 0.4953935 0.5298472 0.5305891 0.5340911 0.5358230 0.5498801 

--------------
     smell      stale     coarse    strange      tired struggling   terrible 
 0.3861283  0.4546362  0.4965594  0.4984882  0.4993610  0.5057617  0.5064134 
    bother       foul      smelt      decay   stifling       wild       suck 
 0.5174079  0.5212522  0.5279060  0.5398718  0.5412109  0.5429565  0.5442462 
     worst irritation 
 0.5513480  0.5520768 
In [174]:
# lawrence
lawrence_pos_sims <- apply(lawrence_embeddings[lawrence_pos_words,], 1, function(rw) cosine(x = lawrence_embeddings['queer',], y = rw))
lawrence_neg_sims <- apply(lawrence_embeddings[lawrence_neg_words,], 1, function(rw) cosine(x = lawrence_embeddings['queer',], y = rw))
lawrence_pos_dists <- 1 - lawrence_pos_sims
lawrence_neg_dists <- 1 - lawrence_neg_sims

lawrence_positivity_queer <- mean(lawrence_neg_dists) - mean(lawrence_pos_dists)
# lawrence_positivity_queer <- sum(lawrence_pos_sims) - sum(lawrence_neg_sims)
print(lawrence_positivity_queer)
[1] 0.03514807
In [176]:
print(lawrence_pos_dists)
cat('--------------\n')
print(lawrence_neg_dists)
    prominent         sharp          grin          like        gentle 
    0.3306781     0.3568192     0.3839144     0.4404632     0.4567575 
        smile      humorous          rapt extraordinary 
    0.4769002     0.4932401     0.4941408     0.5149683 

--------------
            odd          savage          wicked        sinister          absurd 
      0.3695543       0.4073070       0.4165786       0.4202439       0.4243594 
       peculiar         strange            wild          oddest      melancholy 
      0.4427576       0.4464482       0.4465072       0.4520501       0.4552310 
        defiant        pathetic           faint       unnatural           funny 
      0.4570941       0.4587301       0.4641097       0.4721134       0.4751702 
          oddly        crumpled        wrinkled           blunt         jeering 
      0.4793590       0.4841202       0.4863810       0.4913699       0.4931961 
      dangerous           blind         haggard     domineering        devilish 
      0.4935949       0.4985599       0.4993771       0.5047609       0.5068476 
          sneer          shabby         mocking         vicious   irresponsible 
      0.5087988       0.5123263       0.5125555       0.5178180       0.5196422 
       terrible incomprehension 
      0.5215478       0.5231417 
In [178]:
# stein
stein_pos_sims <- apply(stein_embeddings[stein_pos_words,], 1, function(rw) cosine(x = stein_embeddings['queer',], y = rw))
stein_neg_sims <- apply(stein_embeddings[stein_neg_words,], 1, function(rw) cosine(x = stein_embeddings['queer',], y = rw))
stein_pos_dists <- 1 - stein_pos_sims
stein_neg_dists <- 1 - stein_neg_sims

stein_positivity_queer <- mean(stein_neg_dists) - mean(stein_pos_dists)
# stein_positivity_queer <- sum(stein_pos_sims) - sum(stein_neg_sims)
print(stein_positivity_queer)
[1] -0.0171236
In [179]:
print(stein_pos_dists)
cat('--------------\n')
print(stein_neg_dists)
         nice         humor    marvellous     wonderful         lover 
    0.4324507     0.4934363     0.5147753     0.5262983     0.5465836 
    cherished distinguished          fine           fun 
    0.5541693     0.5567429     0.5592445     0.5655816 
--------------
      strange          poor        brutal         nasty          ugly 
    0.4059059     0.4487226     0.4508060     0.4548882     0.4678780 
        funny       flighty     uncertain         badly           sad 
    0.4783487     0.4828184     0.4830050     0.4918080     0.5051616 
          bad         angry       ashamed     disgusted      strained 
    0.5085252     0.5085407     0.5138997     0.5145495     0.5155280 
         hard           fat         shame uncomfortable        poison 
    0.5190932     0.5224168     0.5230778     0.5265052     0.5301953 
         hurt     irritable      stubborn        broken          pale 
    0.5316128     0.5339198     0.5407190     0.5536029     0.5546054 
    expensive     difficult         death       unhappy 
    0.5564286     0.5569840     0.5630232     0.5640897 
기훈련(pre-trained) 모델로 임베딩 활용해 보기¶
In [180]:
# Glove 임베딩 벡터 이용해 보기
pretrained_embeddings <- read.table("./embeddings/pretrained/glove6b/glove.6B.300d.csv", sep = ",", header = FALSE, row.names = 1)
pretrained_embeddings <- as.matrix(pretrained_embeddings)
print(dim(pretrained_embeddings))
[1] 400000    300
In [183]:
# 어휘의 임베딩을 활용해 보기
# val <- pretrained_embeddings['england',]
val <- pretrained_embeddings['king',] - pretrained_embeddings['man',] + pretrained_embeddings['woman',]
# val <- pretrained_embeddings['guy',] - pretrained_embeddings['he',] + pretrained_embeddings['she',]
# print(val)
In [184]:
res <- apply(pretrained_embeddings, 1, function(rw) lsa::cosine(x = val, y = rw))    # val 값을 전체 임베딩 매트릭스와 비교
print(sort(res, decreasing = TRUE)[1:20])
     king     queen   monarch    throne  princess    mother  daughter   kingdom 
0.8065858 0.6896163 0.5575491 0.5565375 0.5518684 0.5142154 0.5133157 0.5025345 
   prince elizabeth      wife     crown     woman       her     royal     marry 
0.5017740 0.4908031 0.4840559 0.4728340 0.4675374 0.4504166 0.4489115 0.4381888 
  married    sister   husband        ii 
0.4308436 0.4290439 0.4238651 0.4199441 
In [ ]:
# Search term 확장하기 (1)
fish <- pretrained_embeddings['fish',]
fishy <- sort(apply(pretrained_embeddings, 1, function(rw) lsa::cosine(x = fish, y = rw)), decreasing = TRUE)
fishy <- names(fishy)[1:10]
print(fishy)
In [ ]:
# Search term 확장하기 (2): 확장된 search terms의 평균 벡터로 확장 검색
comb_vals <- pretrained_embeddings[c("fish","salmon","tuna", "shrimp", "trout"),]
comb_vals <- apply(comb_vals, 2, mean)
expanded_fishy <- apply(pretrained_embeddings, 1, function(rw) lsa::cosine(x = comb_vals, y = rw))    # val 값을 전체 임베딩 매트릭스와 비교
print(sort(expanded_fishy, decreasing = TRUE)[1:50])
-------------------------------------------------------------------------¶